Note: I’m ignoring Taiwan based on my previous assignment.

1.

Density plots of the median age in developing and developed countries:

library(dplyr)
library(ggplot2)
data <- read.csv("final_dataset.csv", sep=";") %>% 
  na_if('.') %>%
  filter(!(country=='Taiwan'))
data$youth_unempl_rate = as.numeric(as.character(data$youth_unempl_rate))
ggplot(data) + 
  geom_density(aes(x=median_age, fill=dev), alpha=0.5) +     
  theme(legend.title=element_blank(),legend.position="top") +
  xlab("Median age of population")

Developed countries tend to have a higher proportion of older adults, based on longer life expectancy and lower fertility rates. In contrast, the density plot for the developing countries shows a more diverse median age distribution, with a higher concentration of the population in the younger age groups.

2.

Density plots of the youth unemployment rate in developing and developed countries:

ggplot(data) + 
  geom_density(aes(x=youth_unempl_rate, fill=dev), alpha=0.5) +     
  theme(legend.title=element_blank(),legend.position="top") +
  xlab("Youth unemployment rate of population")

Based on the graph, developing countries tend to have a higher youth unemployment rate than developed countries. This means that a lot of young people in developing countries are out of work and looking for a job. Developing countries might not have as many job opportunities available for young people, and they might not have good access to education or training that could help them get a job.

3.

Stacked barplots of absolute frequencies showing how the entities are split into regions and development status: (Note: because of ambiguity in assignment text I’m plotting both ways.)

data %>% group_by(region)%>%
    ggplot(aes(fill=dev, x=region)) + 
    geom_bar(position="stack") + 
    theme(legend.title=element_blank(),legend.position="top") 

data %>% group_by(region)%>%
    ggplot(aes(fill=region, x=dev)) + 
    geom_bar(position="stack") + 
    theme(legend.title=element_blank(),legend.position="top") 

Stacked barplots of relative frequencies showing how the entities are split into regions and development status:

data %>% group_by(region)%>%
    ggplot(aes(fill=dev, x=region)) + 
    geom_bar(position="fill") + 
    theme(legend.title=element_blank(),legend.position="top")

data %>% group_by(region)%>%
    ggplot(aes(fill=region, x=dev)) + 
    geom_bar(position="fill") + 
    theme(legend.title=element_blank(),legend.position="top") 

Based on the goal of a particular visualization, one may choose between using relative and absolute frequency for a stacked bar plot. In most cases, I’d say the more insightful version is the relative frequency, because when comparing groups, we are more interested in understanding the overall ratios than actual counts. Absolute frequency can be misleading with groups of different sizes.

4.

Relationship between median age and youth unemployment rate:

ggplot(data, aes(x=median_age, y=youth_unempl_rate, color=dev)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  theme(legend.title=element_blank(),legend.position="top") 

There is a relationship between youth unemployment rate and median age. It seems that the higher the median age of a country, the higher the youth unemployment rate. One could construct a reasonably accurate classifier of developed and developing countries based only on these two variables.

Regarding the nature of this relationship per grouping of development, we can see a larger positive correlation between the variables for developed countries. This correlation is less pronounced in developing countries.

5.

Parallel boxplots of the youth unemployment rate for each region:

ggplot(data, aes(x=youth_unempl_rate, fill=region)) + 
    geom_boxplot() +
    theme(legend.title=element_blank(),legend.position="top") 

Medians seem to be quite similar. Africa is different from other regions in terms of the spread of the youth unemployment rates. On the opposite of this extreme is Europe with the least spread rates. However the differences don’t seem striking.

6.

Parallel boxplots of the median age for each region:

ggplot(data, aes(x=median_age, fill=region)) + 
    geom_boxplot() +
    theme(legend.title=element_blank(),legend.position="top")

The median age boxplots are quite different compared to the youth unemployment rate boxplots. Again the most pronounced are Africa and Europe being the two extremes, where median age seems to differ around 20 years between the two.

7.

Median youth unemployment rate per subregion:

library(forcats)
cbbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#CC79A7")
data %>% 
  filter(!is.na(youth_unempl_rate)) %>%
  group_by(subregion) %>%
  summarize(medYUR = median(youth_unempl_rate), region=region) %>% 
  ggplot(aes(x=medYUR, y=fct_reorder(subregion, medYUR), color=region)) + 
  geom_point() + 
  scale_colour_manual(values=cbbPalette) +
  xlab("Median youth unemployment rate of population") + 
  ylab("Sub-region")

8.

Downloading population data:

population <- read.csv('API_SP.POP.TOTL_DS2_en_csv_v2_5358404.csv', skip=4)

Merging the main dataset with the population dataset:

joined <- data %>% left_join(population, by = c('ISO.3166.3'='Country.Code'))

9.

Relationship between median age and youth unemployment rate with population size indication:

library(plotly)
gpl <- ggplot(joined, aes(x=median_age, y=youth_unempl_rate, color=dev, size=X2020, text=country)) +
  geom_point()

ggplotly(gpl, tooltip = c("text", "x", "y", "size"))